In GPU acceleration, we must abandon the "compute-first" mindset. Modern performance is dictated by Memory Management: the orchestration of data allocation, synchronization, and optimization between the host (CPU) and device (GPU).
1. The Memory-Compute Disparity
While GPU arithmetic throughput ($TFLOPS$) has skyrocketed, memory bandwidth ($GB/s$) has grown at a much slower rate. This creates a gap where the execution units are often "starved," waiting for data to arrive from VRAM. Consequently, GPU programming is often memory programming.
2. The Roofline Model
This model visualizes the relationship between Arithmetic Intensity (FLOPs/Byte) and performance. Applications typically fall into two categories:
- Memory-Bound: Limited by bandwidth (the steep incline).
- Compute-Bound: Limited by peak TFLOPS (the horizontal ceiling).
3. The Tax of Data Movement
The primary performance bottleneck is rarely the math; it is the latency and energy cost of moving a byte across the PCIe bus or from HBM. High-performance code prioritizes data residence and minimizes host-device transfers.